MSDS 7331 - Lab 2 - Airline Satisfaction Dataset¶


Team - Triston Hudgins, Shijo Joseph, Osman Kanteh, Douglas Yip

The dataset chosen is a compilation of airline customer satisfaction surveys. The goal is to predict customer satisfaction based on the other dataset factors.¶

Source: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction?resource=download&select=test.csv

In [1]:
## Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import seaborn as sns
import plotly.express as px

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import StandardScaler
from matplotlib.pyplot import scatter
import plotly
from plotly.graph_objs import Scatter, Marker, Layout, layout,XAxis, YAxis, Bar, Line
%matplotlib inline

##Decision tree setup
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

Data Preparation 1 ( 10 points total) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.¶

In [2]:
# load the airline satisfaction  dataset

df = pd.read_csv('https://raw.githubusercontent.com/dk28yip/MSDS7331/main/airline.csv') # read in the csv file
df.head()
#reduced samples set from 100,000 to 30,000 as a few of us had computer performance issues
df = df.sample(n=30000)

Check for NAs¶

In [3]:
# Any missing values in the dataset
def plot_missingness(df: pd.DataFrame=df) -> None:
    nan_df = pd.DataFrame(df.isna().sum()).reset_index()
    nan_df.columns  = ['Column', 'NaN_Count']
    nan_df['NaN_Count'] = nan_df['NaN_Count'].astype('int')
    nan_df['NaN_%'] = round(nan_df['NaN_Count']/df.shape[0] * 100,4)
    nan_df['Type']  = 'Missingness'
    nan_df.sort_values('NaN_%', inplace=True)

    # Add completeness
    for i in range(nan_df.shape[0]):
        complete_df = pd.DataFrame([nan_df.loc[i,'Column'],df.shape[0] - nan_df.loc[i,'NaN_Count'],100 - nan_df.loc[i,'NaN_%'], 'Completeness']).T
        complete_df.columns  = ['Column','NaN_Count','NaN_%','Type']
        complete_df['NaN_%'] = complete_df['NaN_%'].astype('int')
        complete_df['NaN_Count'] = complete_df['NaN_Count'].astype('int')
        nan_df = pd.concat([nan_df,complete_df], sort=True)
            
    nan_df = nan_df.rename(columns={"Column": "Feature", "NaN_%": "Missing %"})

    # Missingness Plot
    fig = px.bar(nan_df,
                 x='Feature',
                 y='Missing %',
                 title=f"Missingness Plot (N={df.shape[0]})",
                 color='Type',
                 opacity = 0.6,
                 color_discrete_sequence=['red','#808080'],
                 width=800,
                 height=800)
    fig.show()

plot_missingness(df)

print("Missing 99 values if the 'Arrival Delay in Minutes'column; approximately 0.31%.")
Missing 99 values if the 'Arrival Delay in Minutes'column; approximately 0.31%.

Remove unwanted columns¶

ID was removed from the dataset as it was used as a unique identified for each passenger

In [4]:
df["GenderNumeric"] = (df["Gender"]=="Male").astype(int)
df["CustomerTypeNumeric"] = (df["Customer Type"]=="Loyal Customer").astype(int)
df["TypeofTravelNumeric"] = (df["Type of Travel"]=="Personal Travel").astype(int)
df["ClassNumeric"] = df["Class"]
df["ClassNumeric"].replace(['Eco', 'Eco Plus', 'Business'], [0, 1, 2], inplace=True)

df["Arrival Delay in Minutes"]= df["Arrival Delay in Minutes"].fillna(0)


dfclean = df.drop(columns=['id'])

dfclean.isnull().sum() #double check on the missing values - 'arrival delay in minutes =310'
Out[4]:
Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Inflight wifi service                0
Departure/Arrival time convenient    0
Ease of Online booking               0
Gate location                        0
Food and drink                       0
Online boarding                      0
Seat comfort                         0
Inflight entertainment               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Inflight service                     0
Cleanliness                          0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
satisfaction                         0
GenderNumeric                        0
CustomerTypeNumeric                  0
TypeofTravelNumeric                  0
ClassNumeric                         0
dtype: int64
In [5]:
#Fill in missing values
dfclean["Arrival Delay in Minutes"].fillna(dfclean["Arrival Delay in Minutes"].median(), inplace=True)
In [6]:
dfclean.describe().T
Out[6]:
count mean std min 25% 50% 75% max
Age 30000.0 39.333333 15.154087 7.0 27.0 40.0 51.0 85.0
Flight Distance 30000.0 1185.172233 996.070790 31.0 414.0 836.5 1726.0 4983.0
Inflight wifi service 30000.0 2.732333 1.326479 0.0 2.0 3.0 4.0 5.0
Departure/Arrival time convenient 30000.0 3.061733 1.523133 0.0 2.0 3.0 4.0 5.0
Ease of Online booking 30000.0 2.745933 1.394506 0.0 2.0 3.0 4.0 5.0
Gate location 30000.0 2.967167 1.272874 1.0 2.0 3.0 4.0 5.0
Food and drink 30000.0 3.207767 1.325993 0.0 2.0 3.0 4.0 5.0
Online boarding 30000.0 3.256067 1.346263 0.0 2.0 3.0 4.0 5.0
Seat comfort 30000.0 3.439733 1.317887 1.0 2.0 4.0 5.0 5.0
Inflight entertainment 30000.0 3.356600 1.332377 0.0 2.0 4.0 4.0 5.0
On-board service 30000.0 3.393267 1.291664 1.0 2.0 4.0 4.0 5.0
Leg room service 30000.0 3.342067 1.316098 0.0 2.0 4.0 4.0 5.0
Baggage handling 30000.0 3.625367 1.190461 1.0 3.0 4.0 5.0 5.0
Checkin service 30000.0 3.305067 1.265512 1.0 3.0 3.0 4.0 5.0
Inflight service 30000.0 3.643867 1.175875 1.0 3.0 4.0 5.0 5.0
Cleanliness 30000.0 3.293233 1.312036 0.0 2.0 3.0 4.0 5.0
Departure Delay in Minutes 30000.0 14.674167 37.877541 0.0 0.0 0.0 12.0 978.0
Arrival Delay in Minutes 30000.0 14.978233 37.996260 0.0 0.0 0.0 13.0 970.0
GenderNumeric 30000.0 0.490367 0.499916 0.0 0.0 0.0 1.0 1.0
CustomerTypeNumeric 30000.0 0.818067 0.385796 0.0 1.0 1.0 1.0 1.0
TypeofTravelNumeric 30000.0 0.313800 0.464044 0.0 0.0 0.0 1.0 1.0
ClassNumeric 30000.0 1.023467 0.962954 0.0 0.0 1.0 2.0 2.0
In [7]:
dfclean.corr()
Out[7]:
Age Flight Distance Inflight wifi service Departure/Arrival time convenient Ease of Online booking Gate location Food and drink Online boarding Seat comfort Inflight entertainment ... Baggage handling Checkin service Inflight service Cleanliness Departure Delay in Minutes Arrival Delay in Minutes GenderNumeric CustomerTypeNumeric TypeofTravelNumeric ClassNumeric
Age 1.000000 0.088761 0.010898 0.035966 0.017280 -0.008419 0.022636 0.209064 0.159351 0.069824 ... -0.050521 0.033044 -0.059903 0.056137 -0.010983 -0.013793 0.009616 0.276444 -0.045820 0.134797
Flight Distance 0.088761 1.000000 -0.000179 -0.028230 0.059409 0.006073 0.062344 0.210604 0.154774 0.128284 ... 0.065263 0.072926 0.053089 0.092888 0.007348 0.004873 0.002458 0.225608 -0.265108 0.451733
Inflight wifi service 0.010898 -0.000179 1.000000 0.347462 0.717276 0.331487 0.134376 0.455373 0.124499 0.209159 ... 0.120423 0.038001 0.110281 0.136155 -0.024201 -0.025128 0.011443 0.007496 -0.099762 0.030675
Departure/Arrival time convenient 0.035966 -0.028230 0.347462 1.000000 0.439865 0.437732 0.006044 0.070013 0.011336 -0.003539 ... 0.077063 0.095669 0.076152 0.017797 0.000105 -0.003089 0.014659 0.197410 0.265753 -0.102602
Ease of Online booking 0.017280 0.059409 0.717276 0.439865 1.000000 0.449523 0.037039 0.405594 0.035889 0.050755 ... 0.039327 0.012339 0.035587 0.022064 -0.010817 -0.012432 0.008634 0.015385 -0.122766 0.101451
Gate location -0.008419 0.006073 0.331487 0.437732 0.449523 1.000000 -0.003463 0.001483 -0.006495 -0.002177 ... -0.002530 -0.034383 -0.007300 -0.005852 0.006564 0.005248 -0.008748 -0.004155 -0.025673 0.002478
Food and drink 0.022636 0.062344 0.134376 0.006044 0.037039 -0.003463 1.000000 0.237150 0.575546 0.622686 ... 0.031572 0.083065 0.033219 0.654814 -0.034701 -0.038656 0.005760 0.058515 -0.067930 0.084525
Online boarding 0.209064 0.210604 0.455373 0.070013 0.405594 0.001483 0.237150 1.000000 0.421982 0.283006 ... 0.081116 0.191832 0.069611 0.328474 -0.034074 -0.036619 -0.036949 0.190336 -0.224619 0.322052
Seat comfort 0.159351 0.154774 0.124499 0.011336 0.035889 -0.006495 0.575546 0.421982 1.000000 0.604920 ... 0.073030 0.184772 0.067718 0.673116 -0.033966 -0.036365 -0.026255 0.160240 -0.122023 0.222807
Inflight entertainment 0.069824 0.128284 0.209159 -0.003539 0.050755 -0.002177 0.622686 0.283006 0.604920 1.000000 ... 0.372632 0.116432 0.406043 0.691391 -0.031402 -0.035593 0.004107 0.106310 -0.151017 0.188207
On-board service 0.047301 0.099231 0.112452 0.067752 0.032247 -0.029634 0.057255 0.142793 0.125524 0.417427 ... 0.514255 0.250339 0.552803 0.120820 -0.021523 -0.026300 0.010926 0.044113 -0.056015 0.197146
Leg room service 0.035710 0.127854 0.148875 0.011067 0.100626 -0.012060 0.024944 0.113150 0.100312 0.295708 ... 0.373361 0.152777 0.367549 0.090902 0.009297 0.004727 0.043565 0.050027 -0.127187 0.195486
Baggage handling -0.050521 0.065263 0.120423 0.077063 0.039327 -0.002530 0.031572 0.081116 0.073030 0.372632 ... 1.000000 0.239732 0.623748 0.095028 -0.009188 -0.014576 0.041854 -0.022627 -0.027709 0.161611
Checkin service 0.033044 0.072926 0.038001 0.095669 0.012339 -0.034383 0.083065 0.191832 0.184772 0.116432 ... 0.239732 1.000000 0.240392 0.173525 -0.021182 -0.024776 0.011548 0.035507 0.023392 0.150809
Inflight service -0.059903 0.053089 0.110281 0.076152 0.035587 -0.007300 0.033219 0.069611 0.067718 0.406043 ... 0.623748 0.240392 1.000000 0.088736 -0.052682 -0.058853 0.043555 -0.026731 -0.021220 0.147718
Cleanliness 0.056137 0.092888 0.136155 0.017797 0.022064 -0.005852 0.654814 0.328474 0.673116 0.691391 ... 0.095028 0.173525 0.088736 1.000000 -0.018905 -0.022607 0.003672 0.083864 -0.083686 0.134758
Departure Delay in Minutes -0.010983 0.007348 -0.024201 0.000105 -0.010817 0.006564 -0.034701 -0.034074 -0.033966 -0.031402 ... -0.009188 -0.021182 -0.052682 -0.018905 1.000000 0.958246 -0.003874 -0.007068 -0.010379 -0.010214
Arrival Delay in Minutes -0.013793 0.004873 -0.025128 -0.003089 -0.012432 0.005248 -0.038656 -0.036619 -0.036365 -0.035593 ... -0.014576 -0.024776 -0.058853 -0.022607 0.958246 1.000000 -0.007523 -0.005648 -0.011786 -0.014097
GenderNumeric 0.009616 0.002458 0.011443 0.014659 0.008634 -0.008748 0.005760 -0.036949 -0.026255 0.004107 ... 0.041854 0.011548 0.043555 0.003672 -0.003874 -0.007523 1.000000 0.029110 0.003404 0.008502
CustomerTypeNumeric 0.276444 0.225608 0.007496 0.197410 0.015385 -0.004155 0.058515 0.190336 0.160240 0.106310 ... -0.022627 0.035507 -0.026731 0.083864 -0.007068 -0.005648 0.029110 1.000000 0.311644 0.102477
TypeofTravelNumeric -0.045820 -0.265108 -0.099762 0.265753 -0.122766 -0.025673 -0.067930 -0.224619 -0.122023 -0.151017 ... -0.027709 0.023392 -0.021220 -0.083686 -0.010379 -0.011786 0.003404 0.311644 1.000000 -0.543666
ClassNumeric 0.134797 0.451733 0.030675 -0.102602 0.101451 0.002478 0.084525 0.322052 0.222807 0.188207 ... 0.161611 0.150809 0.147718 0.134758 -0.010214 -0.014097 0.008502 0.102477 -0.543666 1.000000

22 rows × 22 columns

In [8]:
f, ax = plt.subplots(figsize=[18, 13])
sns.heatmap(dfclean.corr(), annot=True, fmt=".2f", ax=ax, cmap="bwr")
ax.set_title("Correlation Matrix", fontsize=20)
plt.show()

Very strong correlations (values from 0.8 to 1 or -0.8 to -1.0) Strong correlations (values from 0.6 to 0.8 or -0.6 to -0.8) Moderate correlations (values from 0.4 to 0.6 or -0.4 to -0.6)

Check distribution of the data¶

In [9]:
##distribution of the data

for column in dfclean:
    
    sns.displot(x=column, data=dfclean)
C:\Datascience\Anaconda3\envs\ML7331\lib\site-packages\seaborn\axisgrid.py:409: RuntimeWarning:

More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).


Data Preperation 2 (5 points total) Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).¶

In [10]:
print (dfclean.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 15146 to 77455
Data columns (total 27 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Gender                             30000 non-null  object 
 1   Customer Type                      30000 non-null  object 
 2   Age                                30000 non-null  int64  
 3   Type of Travel                     30000 non-null  object 
 4   Class                              30000 non-null  object 
 5   Flight Distance                    30000 non-null  int64  
 6   Inflight wifi service              30000 non-null  int64  
 7   Departure/Arrival time convenient  30000 non-null  int64  
 8   Ease of Online booking             30000 non-null  int64  
 9   Gate location                      30000 non-null  int64  
 10  Food and drink                     30000 non-null  int64  
 11  Online boarding                    30000 non-null  int64  
 12  Seat comfort                       30000 non-null  int64  
 13  Inflight entertainment             30000 non-null  int64  
 14  On-board service                   30000 non-null  int64  
 15  Leg room service                   30000 non-null  int64  
 16  Baggage handling                   30000 non-null  int64  
 17  Checkin service                    30000 non-null  int64  
 18  Inflight service                   30000 non-null  int64  
 19  Cleanliness                        30000 non-null  int64  
 20  Departure Delay in Minutes         30000 non-null  int64  
 21  Arrival Delay in Minutes           30000 non-null  float64
 22  satisfaction                       30000 non-null  object 
 23  GenderNumeric                      30000 non-null  int32  
 24  CustomerTypeNumeric                30000 non-null  int32  
 25  TypeofTravelNumeric                30000 non-null  int32  
 26  ClassNumeric                       30000 non-null  int64  
dtypes: float64(1), int32(3), int64(18), object(5)
memory usage: 7.1+ MB
None

Summary of values to be used in classification modeling¶

A total of +100,000 passenger results are recorded in this data set. We have a combination of categorical, ordinal and continous variable in this dataset.

  • Gender:- Gender of the passengers (Female, Male) - Categorical Variable
  • Customer Type:- The customer type (Loyal customer, disloyal customer) - Categorical Variable
  • Age:- The actual age of the passengers - Continous Variable
  • Type of Travel:- Purpose of the flight of the passengers (Personal Travel, Business Travel) - Categorical Variable
  • Class:- Travel class in the plane of the passengers (Business, Eco, Eco Plus) - Categorical Variable
  • Flight distance:- The flight distance of this journey - Continous Variable
  • Inflight wifi service:- Satisfaction level of the inflight wifi service (0:Not Applicable;1-5) - Categorical Variable
  • Departure/Arrival time convenient:- Satisfaction level of Departure/Arrival time convenient - Categorical Variable
  • Ease of Online booking:- Satisfaction level of online booking - Categorical Variable
  • Gate location:- Satisfaction level of Gate location - Categorical Variable
  • Food and drink:- Satisfaction level of Food and drink - Categorical Variable
  • Online boarding:- Satisfaction level of online boarding - Categorical Variable
  • Seat comfort:- Satisfaction level of Seat comfort - Categorical Variable
  • Inflight entertainment:- Satisfaction level of inflight entertainment - Categorical Variable
  • On-board service:- Satisfaction level of On-board service - Categorical Variable
  • Leg room service:- Satisfaction level of Leg room service - Categorical Variable
  • Baggage handling:- Satisfaction level of baggage handling - Categorical Variable
  • Check-in service:- Satisfaction level of Check-in service - Categorical Variable
  • Inflight service:- Satisfaction level of inflight service - Categorical Variable
  • Cleanliness:- Satisfaction level of Cleanliness - Categorical Variable
  • Departure Delay in Minutes:- Minutes delayed when departure - Continous Variable
  • Arrival Delay in Minutes:- Minutes delayed when Arrival - Continous Variable
  • Satisfaction:- Airline satisfaction level(Satisfaction, neutral or dissatisfaction) - Categorical Variable

Modeling and Evaluation 1 (10 points total) Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.¶

What is the F1-measure?¶

  • F1 measurement consider the combination of both precision and recall to computed a models performance. Interpretation of the F1 score is a weighted average of the precision and recall values. When F1 score reaches 1, it is consider to have the best probable model performance while 0 would define it the worst.

What does it mean in our model?¶

1) If the model predicts unsatisfied/neutral customers as satisfied customers (high false negatives), the recall probability of our model would be low, and from an airline stand point they could be losing revenue assuming that all things are all right. We want the model to pin point unsatisfied/neutral customers to ensure the company focuses on the right priorities to maximize revenues and profits. As a result of this, we want a high recall score as we do not want our model to label a unsatisfied/neutral customers as satisfied customers.

2) If the model predicts lots of satisfied customers as unsatisfied/neutral customers (high false positives), the precision probability of our model would be low. This may result in airlines prioritizing investments into intiatives to improve satisfaction when they don't need to resulting in possible lower profies. As a a reult of this, we want a high precision score to optimize the airlines profits to avoid wasted investments in which we want the model not to label satisfied customers as a unsatisfied/neutral customers.

Given an airline wants to maximize revenues and profits through customer satification, the model requires to have high precision and recall score. As such, the measure of the F1 score suffices to manage model performance as both these metrics are contained in this measure. To achieve the most optimal model, our model should have the highest F1 score


Modeling and Evaluation 2 (10 points total) Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.¶

Let's check our dataset to see if we have an unbalanced dataset¶

Before we determine how we train our data, we are going to check our dataset to see if we have an unbalanced dataset

In [11]:
print(dfclean["satisfaction"].value_counts())

fig = plt.figure(figsize=(10, 5))
dfclean.groupby('satisfaction').size().plot(kind='pie',
                                       y = "satisfaction",
                                       label = "Type",
                                       autopct='%1.0f%%')
neutral or dissatisfied    17045
satisfied                  12955
Name: satisfaction, dtype: int64
Out[11]:
<AxesSubplot:ylabel='Type'>
In [12]:
#chang output to 1-Satisfied and 0-neutral/unsatisfied
dfclean["satisfaction"] = dfclean["satisfaction"].apply(lambda x: 1 if x == "satisfied" else 0)

Method to train and test model¶

We will use repeated (10 times) 10-fold cross validation for our analysis. The reason why we selected this method is enables us to utilize the whole data set to find the best training model for our dataset. Below are a few other reasons why we selected a 10 fold CV vs just a singl test train data set. 1) More metrics - We get to know more about the model and our underlining assumptions. Espicially given that we don't have domain knowledge of the data, this will inform us better about the data itself 2) Parameters Fine-Tuning - CV allows us to fine tune our model. Selecting the parameters of each model and optimizing those parameters 3) Avoids overfitting - CV runs the model mulitple times. Since 80/20 splits, we assume that our examples are independent. That means that by knowing/seeing some instance will not help us understand other instances. Unfortunately with large datasets this may not be the case and may result in overfitting.

The rest of our lab, we will be using cross_val_score from sk_learn for chosen models.

In [13]:
from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
In [14]:
from sklearn.model_selection import RepeatedStratifiedKFold
x = dfclean.drop(["satisfaction", "Class", "Gender", "Customer Type", "Type of Travel"],axis=1)
y = dfclean['satisfaction']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=1)

Modeling and Evaluation 3 (20 points total) Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!¶

Modeling and Evaluation 4 (10 points total) Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.¶

Logistic Regresion¶

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt

lr_clf = LogisticRegression(penalty='l2', C=1, class_weight=None, solver='liblinear' ) # get object

#calculate logistic regression model average f1 scores of cross validation
lr_scores = cross_val_score(lr_clf, x, y, cv=cv)
print('Logsitics Regression of F1 Score of repeated (10 times) 10-fold cross validation: %.3f (%.3f)' % (mean(lr_scores), std(lr_scores)), "\n\n")
Logsitics Regression of F1 Score of repeated (10 times) 10-fold cross validation: 0.876 (0.006) 


In [17]:
# check how C changes with accruacy prediction for logistic regression
from sklearn.metrics import accuracy_score

scl_obj = StandardScaler()
scl_obj.fit(x)

x_scaled = scl_obj.transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=.2)


accuracy, params = [], []
for c in np.arange(1, 20):
    log_linear = LogisticRegression(penalty='l2', C= c*0.1, class_weight=None, solver='liblinear')
    log_linear.fit(x_train,y_train)
    y_hat = log_linear.predict(x_test)
    
    accuracy.append(accuracy_score(y_test, y_hat))
    params.append(c)
accuracy = np.array(accuracy)
plt.plot(params, accuracy)
plt.ylabel('accuracy of prediction')
plt.xlabel('C')
plt.legend(loc='upper left')
#plt.xscale('log')
plt.show()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
In [18]:
from sklearn.model_selection import GridSearchCV

parameters = {'C':[1, 2, 5, 10, 20, 50]}
log_reg_model = LogisticRegression(max_iter=50000,penalty='l2',class_weight=None,solver='liblinear')
cv_grid = GridSearchCV(log_reg_model, parameters)
cv_grid.fit(x_train, y_train)
cv_grid.best_params_
Out[18]:
{'C': 1}

Our sensitivity analysis on C on both plotting and Gridsearch showed that c=1 yielding the best accuracy to the model. Hence Logistic Regression model used c=1 for this lab.

Decision Tree¶

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
DT_model = DecisionTreeClassifier()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)
In [20]:
#calculate decision tree model average f1 scores of cross validation
DT_scores = cross_val_score(DT_model, x, y, scoring='f1_macro', cv=cv)
print('Decision Tree F1 Score of repeated (10 times) 10-fold cross validation: %.3f (%.3f)' % (mean(DT_scores), std(DT_scores)), "\n\n")
Decision Tree F1 Score of repeated (10 times) 10-fold cross validation: 0.935 (0.005) 


KNN¶

In [30]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
KNN_model = KNeighborsClassifier(n_neighbors=5)
In [22]:
x_scaled = x
scaler = StandardScaler()
scaler.fit(x_scaled)
x_scaled = scaler.transform(x_scaled)
In [23]:
knn_scores = []
for i in list(range(1,10)):
    knn_loop_model = KNeighborsClassifier(n_neighbors=i)
    scores = cross_val_score(knn_loop_model, x_scaled, y, scoring='f1_macro', cv=cv)
    knn_scores.append(mean(scores))
In [24]:
#graph the f1 scores of the different knns
sns.set()
k_scores = pd.DataFrame()
k_scores['k'] = list(range(1, 10))
k_scores['score'] = knn_scores
sns.scatterplot(data=k_scores, x='k', y='score').set(title='K Values and Scores')
Out[24]:
[Text(0.5, 1.0, 'K Values and Scores')]

We observed that odd number KNN values to be more optimal and observed that K=5 to be most optimal. As such we used K=5 for the KNN model selection

In [31]:
#calculate knn model average f1 scores of cross validation
knn_scores = cross_val_score(KNN_model, x_scaled, y, scoring='f1_macro', cv=cv)
print('F1 Score: %.3f (%.3f)' % (mean(knn_scores), std(knn_scores)))
F1 Score: 0.919 (0.005)

Modeling and Evaluation 5 (10 points total) Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.¶

Logistic Regression¶

  • Pros:

    • Makes no assumptions about distributions of classes in feature space
    • Easier to implement, interpret, and very efficient to train
    • Easier to intepret model coefficients as indicators of feature importance
  • Cons:

    • Constructs linear boundaries
    • Non-linear problems can’t be solved with logistic regression because it has a linear decision surface

Decision tree¶

  • Pros:

    • Easy to set parameters of the branches
    • Does not require scaling of the data
    • Does not require normalization of data
    • Quantifies values of outcomes and probabilities
  • Cons:

    • Industry stand point it takes more resources (time and money) to complete
    • Moderate time require to train the model of large data size
    • Setup of dataset is apperative as parameter changes can results in different outcomes

KNN¶

  • Pros:
    • No training periods as it only learns training set at the time of predictions
    • Easy to implement with K and the distance function
  • Cons:
    • Requires scaling of features
    • Requires significant resources for large datasets
In [32]:
import math

#count total samples
dfclean_count=len(dfclean)

#1.96 is the z-score at a 95% CI interval

#Logisitic Regression 95% Confidence Interval
errorLR = 1 - mean(lr_scores)
varLR = ((1-errorLR) * errorLR)/dfclean_count
sqrtNumLR = math.sqrt(varLR)
LRUpper = (errorLR) + (1.96 * sqrtNumLR)
LRLower =(errorLR) - (1.96 * sqrtNumLR)

#KNN 95% Confidence Interval
errorKNN = 1 - mean(knn_scores)
varKNN = ((1-errorKNN) * errorKNN)/dfclean_count
sqrtNumKNN = math.sqrt(varKNN)
KNNUpper = (errorKNN) + (1.96 * sqrtNumKNN)
KNNLower =(errorKNN) - (1.96 * sqrtNumKNN)

#Decision tree 95% Confidence Interval
errorDT = 1 - mean(DT_scores) 
varDT = ((1-errorDT) * errorDT)/dfclean_count
sqrtNumDT = math.sqrt(varDT)
DTUpper = (errorDT) + (1.96 * sqrtNumDT)
DTLower =(errorDT) - (1.96 * sqrtNumDT)

print('1) Decision Tree 95% Confident Interval: ',DTUpper, DTLower)
print('2) KNN 95% Confident Interval: ',KNNUpper, KNNLower)
print('3) Logistic Regression 95% Confident Interval: ',LRUpper, LRLower)
1) Decision Tree 95% Confident Interval:  0.06732967855094532 0.06176832222570802
2) KNN 95% Confident Interval:  0.08453001396923618 0.07834007951360225
3) Logistic Regression 95% Confident Interval:  0.127776835405402 0.12031649792793106

Based on the 95% confidence Intervals of the three models, the Decision Tree yielded the best 95% CI when compared to the other two models.

Modeling and Evaluation 6 (10 points total) Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.¶

Features Guide¶

  • Feature 0: Age
  • Feature 1: Flight Distance
  • Feature 2: Inflight wifi Service
  • Feature 3: Departure/Arrival time convenient
  • Feature 4: Ease of Online Booking
  • Feature 5: Gate Location
  • Feature 6: Food and Drink
  • Feature 7: Online Boarding
  • Feature 8: Seat Comfort
  • Feature 9: Inflight Entertainment
  • Feature 10: On-board Service
  • Feature 11: Leg room service
  • Feature 12: Baggage handling
  • Feature 13: Checkin Service
  • Feature 14: Inflight Service
  • Feature 15: Cleanliness
  • Feature 16: Departure Delay in Minutes
  • Feature 17: Arrival Delay in Minutes
  • Feature 18: Gender Numeric Val
  • Feature 19: Customer Type Numeric
  • Feature 20: Type of Travel Numeric
  • Feature 21: Class Numeric
In [27]:
dt_clf = DecisionTreeClassifier()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)

dt_clf.fit(x_train,y_train)
yhat = dt_clf.predict(x_test)
print ('accuracy:', mt.accuracy_score(y_test,yhat))

# get the importances
imp = dt_clf.feature_importances_

#print out importat features
DT_important = dt_clf.feature_importances_
for i,v in enumerate(DT_important):
	print('Feature: %0d, Score: %.5f' % (i,v))
accuracy: 0.9338333333333333
Feature: 0, Score: 0.02840
Feature: 1, Score: 0.02195
Feature: 2, Score: 0.18038
Feature: 3, Score: 0.00510
Feature: 4, Score: 0.00419
Feature: 5, Score: 0.01455
Feature: 6, Score: 0.00498
Feature: 7, Score: 0.35105
Feature: 8, Score: 0.01347
Feature: 9, Score: 0.04714
Feature: 10, Score: 0.00997
Feature: 11, Score: 0.01219
Feature: 12, Score: 0.01672
Feature: 13, Score: 0.03428
Feature: 14, Score: 0.01472
Feature: 15, Score: 0.01453
Feature: 16, Score: 0.00512
Feature: 17, Score: 0.00865
Feature: 18, Score: 0.00473
Feature: 19, Score: 0.03746
Feature: 20, Score: 0.14977
Feature: 21, Score: 0.02065

Based on the travel demographics, most surveys completed in the dataset were from business travelers. The random samples are consistently pulling almost double the amount of business travelers compared to recreational travelers. Our most influential factors were determined to be Inflight wifi service, online boarding, and traveler type

Deployment (5 points total) How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?¶

Airline industries would find this analysis interesting. It enables air line companies to be able to address where to allocate investments to maintain or improve there current customer satisifaction to ensure optimal revenues and profits. Although our data set was able to identify that business travelers, wifi services and online checking, additional data is required to not only know ones business but the competitive landscape. Adding the data on the airline company flown will help inform how one compares to their competition. Depending on the company strategy, a company can also assess whether the results from the customer survey are reflective of the company's missission or vision. Example. If you are discount airline, does one really care for the bells and whistles that an airline has to offer? We can used the data to focus on the attributes that matter to the company and use the data to validate their strategy. Customer survey data is very useful in assessing the health of the business with the customer and these models can help pin point where to go focus a company's attention.

Exceptional Work (10 points total) You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?¶

Work embedded in work above.

In [ ]: